feat(bigtable): add client side metric instrumentation to basic rpcs by daniel-sanche · Pull Request #16712 · googleapis/google-cloud-python

daniel-sanche · 2026-04-17T20:19:17Z

Migration of googleapis/python-bigtable#1188 to the monorepo

This PR builds off of googleapis/python-bigtable#1187 to add instrumentation to basic data client rpcs (check_and_mutate, read_modify_write, sample_row_keys, mutate_row)

Metrics are not currently being exported anywhere, just collected and dropped. A future PR will add a GCP exporter to the system

gemini-code-assist

Code Review

This pull request integrates client-side metrics tracking into the Bigtable data client for both asynchronous and synchronous implementations. It wraps key operations—including sample_row_keys, mutate_row, check_and_mutate_row, and read_modify_write_row—with metrics collection logic and introduces a tracked_retry wrapper to monitor retry attempts. Additionally, the PR refactors system tests by consolidating fixtures into a shared SystemTestRunner class and adds new system tests specifically for metrics. Feedback focuses on regressions in retry logic where critical arguments like sleep_generator and exception_factory were omitted in the transition to tracked_retry. There are also suggestions to improve resource cleanup in test fixtures and to relax restrictive timing assertions in tests to prevent flakiness.

gemini-code-assist · 2026-04-17T20:26:15Z

+                    ),
+                    clusters=cluster_config,
+                )
+                operation.result(timeout=240)


If operation.result(timeout=240) raises an exception (e.g., a TimeoutError), the fixture will stop execution and the delete_instance call in the teardown phase will never be reached. This can lead to leaked Bigtable instances in the test project. Consider wrapping the creation and yield in a try...finally block or ensuring cleanup happens even on creation failure.

gemini-code-assist · 2026-04-17T20:26:15Z

+            operation.zone
+            == cluster_config[operation.cluster_id].location.split("/")[-1]
+        )
+        assert operation.duration_ns > 0 and operation.duration_ns < 1e9


The assertion operation.duration_ns < 1e9 (1 second) might be too restrictive for system tests running against a live backend. Network latency or backend load could easily cause an RPC to exceed 1 second, leading to flaky tests. It is recommended to remove this upper bound or increase it significantly.

Suggested change

assert operation.duration_ns > 0 and operation.duration_ns < 1e9

assert operation.duration_ns > 0

mutianf

some final nit otherwise lgtm

mutianf · 2026-04-23T14:44:21Z

        tuple[Exception, Exception|None]:
            tuple of the exception to raise, and a cause exception if applicable
    """
+    exc_list = exc_list.copy()


why do we need to copy the exception list now?

This is cleaner, we wouldn't expect a factory method like this to modify its input arguments. This creates an isolated reference

IIRC I think this only came up in test code though

mutianf · 2026-04-23T15:09:20Z

+        # validate operation
+        operation = handler.completed_operations[0]
+        assert isinstance(operation, CompletedOperationMetric)
+        assert operation.final_status.value[0] == 0


why is this final_status.value[0] and not just final_status.value? or final_status.name = 'OK'?

value is a tuple of (0, "ok"). We could just do final_status.name == 'OK' if you think that's clearer though

mutianf · 2026-04-23T15:13:32Z

+        for i in range(num_retryable):
+            attempt = handler.completed_attempts[i]
+            assert isinstance(attempt, CompletedAttemptMetric)
+            assert attempt.end_status.name == "ABORTED"


carry over my comment from before, I don't think we retry aborted error for mutate row by default, so maybe use unavailable instead? And also a little suprised that this test passes?

We allow the user to set custom retryable errors, which is what the test is doing here. So that's why it passes.

I can't remember for sure why I chose to use explicit retryable errors here instead of the default. Maybe just so the test makes sense even if you don't know the defaults. I could probably switch it to UNAVAILABLE though if you prefer

mutianf · 2026-04-29T15:16:26Z

+                )
+                return [(s.row_key, s.offset_bytes) async for s in results]
+
+            return await tracked_retry(


where did sleep_generator and exception_factory go?

tracked_retry contains them both. needed a custom versions of sleep_generator to report backoff, and exception_factory to report terminal errors metrics module. See go/bigtable-csm-python

mutianf · 2026-04-29T15:18:28Z

+                timeout=operation_timeout,
+                retry=None,
+            )
+            return result.predicate_matched


this is not wrapped in tracked_retry. Will the attempt level metrics (attempt latencies, server latencies and connectivity error count) still be recorded?

check_and_mutate shouldn't have retries, right?

But yes, a sngle attempt is will be recorded when this completes. The duration/gfe_latency data is captured, and will be exported as those metrics in the follow-up PR

mutianf · 2026-04-29T15:18:40Z

+                retry=None,
+            )
+            # construct Row from result
+            return Row._from_pb(result.row)


same question as check and mutate.

Yes, those are also captured here

daniel-sanche added 3 commits April 17, 2026 12:38

moved over instrumentation code

8a3af45

refactored system tests

cf58c57

updated tests

00618f8

daniel-sanche requested a review from a team as a code owner April 17, 2026 20:19

gemini-code-assist Bot reviewed Apr 17, 2026

View reviewed changes

fixed lint

ead3bf1

daniel-sanche added the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 21, 2026

yoshi-kokoro removed the kokoro:force-run Add this label to force Kokoro to re-run the tests. label Apr 21, 2026

daniel-sanche and others added 4 commits April 21, 2026 12:39

Merge branch 'main' into bigtable_csm_1_basic_instrumentation

af0beba

regenerated files

d0c03d0

fixed mtls tests

01f1739

fixed format

d40fa0a

daniel-sanche mentioned this pull request Apr 21, 2026

[DRAFT] feat: added client side metric instrumentation to read_rows and mutate_rows #16758

Draft

mutianf reviewed Apr 23, 2026

View reviewed changes

mutianf reviewed Apr 29, 2026

View reviewed changes

parthea assigned daniel-sanche May 1, 2026

	assert operation.duration_ns > 0 and operation.duration_ns < 1e9
	assert operation.duration_ns > 0

Conversation

daniel-sanche commented Apr 17, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

mutianf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants